refactor: Extract `RegexDFAState` class, `RegexDFAStatePair` class, and `RegexDFAStateType` enum into their own files. #57

SharafMohamed · 2024-12-05T16:13:05Z

Doing a git diff will provide better comparisons for files with changed names than using the Github UI:

git fetch upstream
git fetch upstream pull/57/head:pr-57
git diff upstream/main pr-57

Description

Separate DFA Functionality into different files.

RegexDFAState class moved into its own file.
RegexDFAStatePair class moved into its own file.
RegexDFAStateType enum moved into its own file.

Validation performed

Previously existing tests still succeed.

Summary by CodeRabbit

Release Notes

New Features
- Introduced new header files for managing DFA states and their types.
- Added functionality for creating and managing DFA states, including methods for transitions and acceptance checks.
- Implemented logic for computing intersections of DFAs.
Bug Fixes
- Streamlined the DFA state management by consolidating redundant structures.
Documentation
- Updated documentation to reflect new classes and methods for better clarity on DFA state management.

…ransitions) return nullopt if state_ids is malformed.

Co-authored-by: Lin Zhihao <[email protected]>

…ake it clear to the reader that both failures are handled the same way and return nullopt. For more complicated return cases it would warrant the reader looking at the doc for the individual functions, but here I think we can make their life easier.

Co-authored-by: Lin Zhihao <[email protected]>

…ation.

…-surgeon into tagged-nfa-new

…nor are parts of the rules stored, instead the rules are only read and used to build the NFA.

Co-authored-by: Lin Zhihao <[email protected]>

…call succeeds in NFA's serialize. Co-authored-by: Lin Zhihao <[email protected]>

…on classes when they are initialized in their constructor.

…d transitions instead of emplace back.

…han one leaving an NFA state.

…egativeTaggedTransition classes into their own files.

…just an id. This object is created and owned by the capture AST, and other AST and NFA states point to these tags.

…that was accidentally removed.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)
35-35: Pass small integral type uint8_t by value instead of by const reference

The parameter byte in add_byte_transition is a uint8_t, which is more efficiently passed by value rather than by const reference. Passing small integral types by value avoids unnecessary indirection.

Apply this diff to pass byte by value:
-auto add_byte_transition(uint8_t const& byte, RegexDFAState<stateType>* dest_state) -> void {
+auto add_byte_transition(uint8_t byte, RegexDFAState<stateType>* dest_state) -> void {
src/log_surgeon/finite_automata/RegexDFA.hpp (1)
65-65: Use idiomatic condition in the while loop

Consider using !unvisited_pairs.empty() instead of false == unvisited_pairs.empty() for better readability and to align with common coding practices.

Apply this diff:
-    while (false == unvisited_pairs.empty()) {
+    while (!unvisited_pairs.empty()) {
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

70-80: Offer assistance to implement UTF-8 transitions

The method get_reachable_pairs currently handles only single-byte transitions, as indicated by the TODO comment. Supporting UTF-8 (multi-byte) transitions is essential for full Unicode compatibility.

I can help implement support for UTF-8 transitions in this method. Would you like me to provide a solution or open a GitHub issue to track this enhancement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 99b5b08 and 66ed13b.

📒 Files selected for processing (6)

CMakeLists.txt (1 hunks)
src/log_surgeon/finite_automata/RegexDFA.hpp (2 hunks)
src/log_surgeon/finite_automata/RegexDFA.tpp (0 hunks)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1 hunks)

💤 Files with no reviewable changes (1)

src/log_surgeon/finite_automata/RegexDFA.tpp

✅ Files skipped from review due to trivial changes (1)

src/log_surgeon/finite_automata/RegexDFAStateType.hpp

🧰 Additional context used

📓 Learnings (3)

CMakeLists.txt (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.

src/log_surgeon/finite_automata/RegexDFA.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.

🔇 Additional comments (1)

CMakeLists.txt (1)

100-102: Addition of new source files is correct

The inclusion of the new source files RegexDFAState.hpp, RegexDFAStatePair.hpp, and RegexDFAStateType.hpp into the build system is appropriate. This ensures that the newly created classes are compiled and linked properly.

src/log_surgeon/finite_automata/RegexDFAState.hpp

src/log_surgeon/finite_automata/RegexDFA.hpp

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

src/log_surgeon/finite_automata/RegexDFA.hpp (2)

12-14: Consider addressing the TODO comment in this PR

The TODO suggests renaming RegexDFA to DFA. Since this PR is already refactoring the DFA-related code, it might be a good opportunity to implement this naming change as well.

Would you like me to help with renaming the class and updating all its references?

60-60: Track the UTF-8 handling requirement

The TODO comment indicates missing functionality for handling UTF-8 multi-byte transitions.

Would you like me to create a GitHub issue to track the UTF-8 handling requirement?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 66ed13b and 176391b.

📒 Files selected for processing (2)

src/log_surgeon/finite_automata/RegexDFA.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/log_surgeon/finite_automata/RegexDFAState.hpp

🧰 Additional context used

📓 Path-based instructions (1)

src/log_surgeon/finite_automata/RegexDFA.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexDFA.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🔇 Additional comments (3)

src/log_surgeon/finite_automata/RegexDFA.hpp (3)

27-34: LGTM! Parameter type changed as suggested

The method signature has been updated to use a raw pointer instead of a unique_ptr, which is more appropriate for this use case where ownership transfer isn't needed.

40-52: LGTM! Clean implementation of state creation

The implementation properly manages memory using smart pointers and correctly handles the transfer of matching variable IDs from NFA to DFA states.

54-72: LGTM! Well-structured intersection algorithm

The implementation:

Correctly manages state pair traversal
Properly collects matching variable IDs
Follows coding guidelines with false == unvisited_pairs.empty()

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

55-55: Optimise memory usage by conditionally including m_bytes_transition

The member m_bytes_transition is always included in RegexDFAState, even when stateType is UTF8, which can lead to unnecessary memory usage. Consider conditionally including m_bytes_transition only when stateType is Byte, similar to how m_tree_transitions is conditionally included.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

71-71: Implement UTF-8 handling in get_reachable_pairs

There is a TODO comment indicating that UTF-8 (multi-byte transitions) are not yet handled in the get_reachable_pairs method. Implementing this functionality is important for full UTF-8 support. Would you like assistance in adding this implementation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 176391b and 0decaf5.

📒 Files selected for processing (4)

examples/intersect-test.cpp (1 hunks)
src/log_surgeon/Lexer.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexDFAState.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1 hunks)

✅ Files skipped from review due to trivial changes (1)

src/log_surgeon/Lexer.hpp

🧰 Additional context used

📓 Path-based instructions (3)

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexDFAState.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

🔇 Additional comments (2)

src/log_surgeon/finite_automata/RegexDFAState.hpp (1)

63-65: ⚠️ Potential issue

Add bounds check on character in Byte state

Accessing m_bytes_transition[character] without validating character may lead to out-of-bounds access if character is greater than or equal to cSizeOfByte. Consider adding an assertion to ensure character is within bounds to prevent potential errors.

Apply this diff to add the assertion:

 if constexpr (RegexDFAStateType::Byte == stateType) {
+    assert(character < cSizeOfByte);
     return m_bytes_transition[character];
 }

examples/intersect-test.cpp (1)

45-45: Function call to get_intersect updated correctly

The get_intersect function call has been correctly updated to pass a raw pointer using dfa2.get(), aligning with the updated function signature. This ensures proper functionality.

LinZhihao-723

The refactoring looks good to me. Let's make an agreement on the macro naming and then we can merge

LinZhihao-723 · 2024-12-10T04:12:34Z

src/log_surgeon/finite_automata/RegexDFAState.hpp

@@ -0,0 +1,80 @@
+#ifndef LOG_SURGEON_FINITE_AUTOMATA_REGEX_DFA_STATE


Sorry to miss this in previous refactor PRs: I think we should name macros to exactly match the file name, so this header should be LOG_SURGEON_FINITE_AUTOMATA_REGEXDFASTATE instead. We can create an issue to keep track of this and fix them all together later

kk sounds good, I'll create the issue. I was previously separating it on capitalization, e.g. log_surgeon/finite_automate/DfaState would use #ifndef LOG_SURGEON_FINITE_AUTOMATA_DFA_STATE as the correct snake_case naming for the separate words (as we're combining snake_case folder names and camal_case file names).

Issue created.

LinZhihao-723

The PR title looks good to me.
The macro naming issue is tracked here: #65

SharafMohamed and others added 30 commits October 24, 2024 11:48

Have internal serialize() functions for RegexNFA (states and tagged t…

e8db277

…ransitions) return nullopt if state_ids is malformed.

Reserve space during BFS; Run linter.

337cead

Add braced initialization to nfa.

4a30fdc

Co-authored-by: Lin Zhihao <[email protected]>

Update docstring for positive tag serialization.

0203038

Co-authored-by: Lin Zhihao <[email protected]>

Update docstring for negative tag serialization.

633acc4

Co-authored-by: Lin Zhihao <[email protected]>

Use return statement for full docstring of get_bfs_traversal_order.

4db7b82

Co-authored-by: Lin Zhihao <[email protected]>

Update NFA serialize() docstring.

01f8b14

Co-authored-by: Lin Zhihao <[email protected]>

Add long form of BFS for first use.

d047624

Use const for state_id_it.

f9c4f46

Co-authored-by: Lin Zhihao <[email protected]>

Update docstring for NFA state serialize.

bd77c78

Co-authored-by: Lin Zhihao <[email protected]>

Use const for state_id_it.

4cb560f

Co-authored-by: Lin Zhihao <[email protected]>

For NFA state serialize flip order of failure checks to reduce indent…

95b7497

…ation.

Merge branch 'tagged-nfa-new' of https://github.com/SharafMohamed/log…

e187445

…-surgeon into tagged-nfa-new

Use const& for passing rules into the NFA as rules are never stored, …

8b85511

…nor are parts of the rules stored, instead the rules are only read and used to build the NFA.

Use braced initialization for NFA.

0756794

Co-authored-by: Lin Zhihao <[email protected]>

Remove warning for not check std::optional when we know the function …

6ab439a

…call succeeds in NFA's serialize. Co-authored-by: Lin Zhihao <[email protected]>

Remove redundant initialzation of member variables in tagged transiti…

9244812

…on classes when they are initialized in their constructor.

Use member initialization lists for constructing NFA state from tagge…

0d151a4

…d transitions instead of emplace back.

Switch to using optional prefix for optional return types.

ac63713

Make negative tagged transition singular as you can never have more t…

b57b93f

…han one leaving an NFA state.

Add missing param for new_state_with_negative_tagged_transitions.

c3fb16d

Move RegexNFAStateType, RegexNFAState, and PositiveTaggedTransition/N…

8a41367

…egativeTaggedTransition classes into their own files.

Add tag class.

d1a57e4

Make tag an object with name, start, and end information, instead of …

bc78f59

…just an id. This object is created and owned by the capture AST, and other AST and NFA states point to these tags.

Run linter.

ac7260f

Merge branch 'main' into singular-negative-transition

40a8206

Change t to curr_state and u to dest_state.

c2eea21

Change curr_state to current_state; Remove extraneous *; Add newline …

629fce9

…that was accidentally removed.

Add TODO for utf8 case in BFS.

aed62b2

SharafMohamed added 4 commits December 5, 2024 18:22

Add docstring for get_parent_id_unsafe().

8abf35a

Linter.

1e5fdcc

Merge branch 'register' into individual-dfa-files

71d926d

Merge branch 'main' into individual-dfa-files

66ed13b

coderabbitai bot reviewed Dec 6, 2024

View reviewed changes

src/log_surgeon/finite_automata/RegexDFAState.hpp Show resolved Hide resolved

src/log_surgeon/finite_automata/RegexDFAState.hpp Outdated Show resolved Hide resolved

src/log_surgeon/finite_automata/RegexDFA.hpp Outdated Show resolved Hide resolved

SharafMohamed changed the title ~~refactor: Separate DFA functionality into different files.~~ refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. Dec 6, 2024

SharafMohamed requested a review from LinZhihao-723 December 6, 2024 23:38

SharafMohamed added 3 commits December 6, 2024 18:47

Fix comment length.

a12a360

Initialize byte transitions.

244d122

Use const* in place of unique_ptr reference; Update docstrings.

176391b

coderabbitai bot reviewed Dec 7, 2024

View reviewed changes

SharafMohamed added 7 commits December 6, 2024 19:20

Update intersect test to compile.

012f61f

Update next() docstring.

96a6363

Update headers.

421c3de

Update Lexer headers.

1b945a1

Add header for conditional_t.

78c4125

Linter.

33623fa

Change ! to false ==.

0decaf5

coderabbitai bot reviewed Dec 8, 2024

View reviewed changes

LinZhihao-723 requested changes Dec 10, 2024

View reviewed changes

LinZhihao-723 approved these changes Dec 11, 2024

View reviewed changes

LinZhihao-723 merged commit 081b20f into y-scope:main Dec 11, 2024
9 checks passed

This was referenced Dec 11, 2024

refactor: Update NFA and DFA headers to align with the latest coding guidelines. #58

Merged

refactor: Add functionality for tagged DFA. #62

Closed

feat: Add register handler to the Dfa class. #66

Closed

LinZhihao-723 mentioned this pull request Dec 19, 2024

refactor: Standardize header guard macros. #65

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Extract `RegexDFAState` class, `RegexDFAStatePair` class, and `RegexDFAStateType` enum into their own files. #57

refactor: Extract `RegexDFAState` class, `RegexDFAStatePair` class, and `RegexDFAStateType` enum into their own files. #57

SharafMohamed commented Dec 5, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 left a comment

LinZhihao-723 Dec 10, 2024

SharafMohamed Dec 11, 2024 •

edited

Loading

SharafMohamed Dec 11, 2024

LinZhihao-723 left a comment

		@@ -0,0 +1,80 @@
		#ifndef LOG_SURGEON_FINITE_AUTOMATA_REGEX_DFA_STATE

refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. #57

refactor: Extract RegexDFAState class, RegexDFAStatePair class, and RegexDFAStateType enum into their own files. #57

Conversation

SharafMohamed commented Dec 5, 2024 • edited Loading

Description

Validation performed

Summary by CodeRabbit

Release Notes

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

LinZhihao-723 Dec 10, 2024

Choose a reason for hiding this comment

SharafMohamed Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

SharafMohamed Dec 11, 2024

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

refactor: Extract `RegexDFAState` class, `RegexDFAStatePair` class, and `RegexDFAStateType` enum into their own files. #57

refactor: Extract `RegexDFAState` class, `RegexDFAStatePair` class, and `RegexDFAStateType` enum into their own files. #57

SharafMohamed commented Dec 5, 2024 •

edited

Loading

SharafMohamed Dec 11, 2024 •

edited

Loading